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Abstract 

Search engine returns thousands of web pages for a single user query, in which most of 
them are not relevant. In this context, effective information retrieval from the expanding web 
is a challenging task, in particular, if the query is ambiguous. The major question arises here 
is that how to get the relevant pages for an ambiguous query. We propose an approach for 
the effective result of an ambiguous query by forming community vector based on association 
concept of data minning using vector space model and the freedictionary. We develop clusters 
by computing the similarity between community vectors and document vectors formed from 
the extracted web pages by the search engine. We use Gensim package to implement the 
algorithm because of its simplicity and robust nature. Analysis shows that our approach is an 
effective way to form clusters for an ambiguous query. 
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1 Introduction 

On the web, search engines are key for the information retrieval (IR) for any user query. However, 
resolving ambiguous query is a challenging task, hence a vibrant area of research. Due to short and 
ambiguity in the user query, retrieving the information as per the intention of user in large volume 
of web is not straight forward. The ambiguities in queries is due to the short query length, which 
is on an average is 2.33 times on a popular search engine [1]. In this context, Sanderson [2] reports 
that 7%-23% of the queries frequently occur in two search engines are ambiguous with the average 
length one. For e.g. the familiar word Java which is ambiguous as it has multiple senses viz. 
Java coffee, Java Island and Java programming language etc. In the user query, ambiguities can 
also exists which do not appear in surface. Because of such ambiguities, search engine generally 
does not understand in what context user is looking for the information. Hence, it returns huge 
amount of information, in which most of the retrieved pages are irrelevant to the user. These huge 
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amount of heterogeneous information retrieve not only increases the burden for search engine but 
also decreases its performance. 

In this paper we propose an approach to improve the effectiveness of search engine by making 
clusters of word sense based on association concept of data mining, using vector space model of 
Gensim [6] and the freedictionary [13]. The association concept on which the clusters has formed 
can be describe as follows. Suppose, if user queried for the word Apple, which is associated 
in multiple context viz. computer, fruit, company etc. Each of this context associated with Ap- 
ple is again associate with different word senses viz. computer is associated with the keyboard, 
mouse, monitor etc. Hence computer can be taken as community vector or cluster whose com- 
ponents/elements are the associated words keyboard, mouse, monitor, etc. Here, each element in 
the cluster represent the sense of computer vector for apple. So, if a user looking for apple as a 
computer, s'he may look for 'apple keyboard' or 'apple mouse' or 'apple monitor' etc. We use 
Minipar [16] to transform a complete sentence into a dependency tree and for the classification of 
words and phrases into lexical categories. 

The paper is organized as follows. In section 2 we examine the related work on the information 
retrieval based on clustering technique. In section 3 we briefly discuss the Gensim package for the 
implementation of our approach. In section 4 we present our approach for the effective information 
retrieval in the context of user query. Section 5 contains analysis of the algorithm. Finally Section 
6 is the conclusion of the paper. 

2 Related Work 

Ranking and Clustering are the two most popular methods for information retrieval on the web. In 
ranking, a model is designed using training data, such that model can sort new objects according to 
their relevance's. There are many ranking models [14] which can be roughly categorized as query- 
dependent and query-independent models. In the other method i.e. clustering, an unstructured 
set of objects form a group, based on the similarity among each other. One of the most popular 
algorithms on clustering is k-means algorithm. However, the problem of this algorithm is that an 
inappropriate choice of clusters (k) may yield poor results. In case of an ambiguous query, word 
sense discovery is one of the useful method for IR in which documents are clustered in corpus. 
Discovering word senses by clustering the words according to their distributional similarity is done 
by Patrick et al, 2002. The main drawback of this approach is that they require large training data 
to make proper cluster and its performance is based on cluster centroid, which changes whenever 
a new web page is added to it. Hence identifying relevant cluster will be a tedious work. 

Herrera et al., 2010 gave an approach, which uses several features extracted from the document 
collection and query logs for automatically identifying the users goal behind their queries. This 
approach success to classifies the queries into different categories like navigational, informational 
and transactional (B. J. Jansen et al., 2008) but fails to classify the ambiguous query. As query logs 
has been used, it may raise privacy concerns as long sessions are recorded and may led to ethical 
issues surrounding the users data collections. Lilyaa et.al [15] uses statistical relational learning 
(SRL) for the short ambiguous query based only on a short glimpse of user search activity, captured 
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in a brief search session. Many research has been done to map user queries to a set of categories 
(Powell et al., 2003; Dolin et al., 1998; Yu et al., 2001). But all of the above techniques fails to 
identify the user intention behind the user query. 

The Word Sense Induction (Roberto Navigli et.al, 2010) method is a graph based clustering 
algorithm, in which snippets are clustered based on dynamic and finer grained notion of sense. 
The approach (Ahmed Sameh et al, 2010) with the help of modified Lingo algorithm, identifying 
frequent phrases as a candidate cluster label, the snippets are assigned to those labels. In this 
approach semantic recognition is identified by WordNet which enables recognition of synonyms 
in snippets. Clusters formation by the above two approaches not contain all the relevant pages of 
user choice. Our work uses free dictionary and association concept of data mining has been added 
to our approach to form clusters. Secondly it can handle the dynamic nature of the web as Gensim 
has been used. Hence the user intention behind the ambiguous query can be identified in simple 
and efficient manner. 

In 2008, Jiyang Chen et. al. purposed an unsupervised approach to cluster results by word 
sense communities. Clusters are made based on dependency based keywords which are extracted 
for large corpus and manual label are assigned to each cluster. In this paper we form the commu- 
nity vector and eliminate the problem of manual assignment of the cluster lable. We use Gensim 
package to avoid the dependency of the large training corpus size [5], and its ease of implementing 
vector space model (e.g. LSI, LDA). 

3 Gensim 

Gensim package is a python library for vector space modeling, aims to process raw, unstructured 
digital texts ("plain text"). It can automatically extract semantic topics from documents, used 
basically for the Natural Language Processing (NLP) community. Its memory (RAM) independent 
feature with respect to the corpus size allows to process large web based corpora. In Gensism one 
can easily plugin his own input corpus and data stream and other vector space algorithms can be 
trivially incorporated in it. 

In Gensim, many unsupervised algorithms are based on word co-occurrence patterns within a 
corpus of training documents. Once these statistical patterns are found, any plain text documents 
can be succinctly expressed in the new semantic representation and can be queried for the topical 
similarity against other documents and so on. In addition it has following salient features 

• Straightforward interfaces, scalable software framework, low API learning curve and proto- 
typing. 

• Efficient implementations of several popular vector space algorithms, calculation of TF- 
IDF (term frequency-inverse document frequency), distributed incremental Latent Semantic 
Analysis, distributed incremental incremental Latent Dirichlet Allocation(LDA). 

• I/O wrappers and converters around several popular data formats. 
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Vector Space Model: 

In vector space model, each document is defined as a multidimensional vector of keywords 
in euclidean space whose axis correspond to the keyword i.e., each dimension corresponds to a 
separate keyword [4]. The keywords are extracted from the document and weight associated with 
each keyword determines the importance of the keyword in the document. Thus, a document is 
represented as, 



D 3 = (wij,W 2 j,W 3j ,W4j, W nj ) 

where Wij is the weight of term % in document j indicating the relevance and importance of the 
keyword. 

TF-IDF Concept: TF is the measure of how often a word appears in a document and IDF is the 
measure of the rarity of a word within the search index. Combining TF-IDF is used to measure the 
statistical strength of the given word in reference to the query. Mathematically, 

TFj = 



where, is the number of occurrences of the considered terms and n k is the number of occurrences 
of all terms in the given document 

, N 
IDF, = log — 

dfi 

where, N is the number of occurrences of the considered terms and dfi is the number of documents 
that contain term i. 

TF-IDF = TF, x log ^ 

Cosine Similarity Measure: It is a technique to measure the similarity between the document 
and the query. The angle (9) between the document vector and the query vector determines the 
similarity between the document and the query and it is written as 

C ose = -£^pL= (i) 

■\J^2 w q,j an d \JY1 w ij i s me length of the query and document vector respectively. 

If 6 = 0° then the document and query is similar. As 9 changes from 0° to 90°, the similarity 
between the document and query decreases i.e. D 2 will be more similar to query than Di, if the 
angle between D 2 and query is smaller than the angel between Di and query. 
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4 Our Approach 



Our approach for an ambiguous query is described below in five steps and depicted in the flow 
chart (Fig. 1). 

1. Web page extraction and preprocessing: Submit the ambiguous query to a search engine and 
extract top n pages. Preprocess the retrieve corpus as follows: 

• Remove the stop and unwanted words. 

• Select noun as the keywords from the corpus using Minipar [16] and ignore other cat- 
egories, such as verbs, adjectives, adverbs and pronounce. 

• Do stemming using porter algorithm [12]. 

• Save each processed n pages as documents D k , where k — 1, 2, 3, n. 

2. Document vectors: Compute TF and IDF score for all the keywords of each D k using Gen- 
sim and make document vectors of all the retrieved pages. 

3. Cluster formation: We use the freedictionary with the option start with to form the commu- 
nity vector of the queried word as follows 

• Submit the ambiguous query (say apple) to the freedictionary, preprocess the retrieved 
data i.e. remove the queried, stop & unwanted words. After stemming, save all the 
noun as keywords (Wj) in a file F c , where j = 1, 2, 3, m 

• Now submit each Wj again to the freedictionary, preprocess the retrieved data and save 
the noun as keywords along with the queried word in a community file F w . . 

• Search all the words of F Wj in D k using regular expression search technique. 

• Delete those words in F Wj which are not present in D k . 

• Wj is the formed community vectors (clusters) whose elements are the words saved in 
the file F Wj 

• Compute TF-IDF for each word in F Wj in compare with D k to form community vec- 
tors. 

4. Similarity check: Compute the cosine similarities between the formed documents and com- 
munity vectors using eq. 1 . 

5. Assignment of Documents to the Clusters: Assign the documents to that cluster which has 
maximum similarity. 
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Ambiguous Query 



Extract top 'n' pages from a search engine 
and do the preprocessing. 



Extract nouns as key 
pages and save it in s 


words trom each *n' 
?parate documents ( D ^ 






Compute TF and IDF for each keyword of 
Dkto form document vectors. 



FreeDictionary 



Preprocess the retrieve data then save all the nouns 
(except queried word) as keywords ( Wj) in a file F c 



Again submit each word from the file Fgto FreeDictionary. 
Preprocess the retrieved data for each word and then 
save the nouns and the queried word in a separate file Fw, 



Search each word of F w in D k 



Delete those words from Fw. which are not present in D k 



Compute TF and IDF for each word of F w .in compare 
with Dk to form community vectors. 1 



Compute similarities between each community vector and document vector 



Assign documents to clusters which has maximum similarities 



Figure 1 : An effective IR for an ambiguous query 



5 Test Results 

To illustrate our approach we took four sample documents as shown in Table 1 . We preprocess 
the documents and extracted ten keywords (apple, computer, tree, keyboard, mouse, juice, country, 
vegetables, fruit, monitor) from the sample (Table 2). After assigning a token ID to each selected 
keyword (Table 3) TF & IDF are computed which is shown in Table 4. In Table 5 computed weight 
(TF-IDF) of all the four sample documents are given.With the calculated weight and respective 
token IDs, document vectors are generated (Table 6). 

The community vectors are formed as described in the section 4 (Table 7) and the correspond- 
ing TF-IDF and weights are calculated (Table 8). Cosine similarity are calculated defined by the 
eq. 1. Now the similarity between each community vector (CI, C2) and the set of document vec- 
tors (Dl, D2, D3 and D4) are computed and maximum values of the similarity between community 
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and document vectors form the cluster. From our experimental result, we found that (Dl, D3 ) and 
(D2, D4) associated with CI and C2 respectively i.e. two clusters are generated (Table 9 and 10). 

As an example, from the Table 10 we say that if the user search the ambiguous word apple, 
s'he will get two clusters CI and C2, containing most relevant documents. 



6 Conclusion 

For an ambiguous query, we propose an effective approach for the IR by forming the clusters of 
relevant web pages. For cluster formation we use standard vector space model and the freedic- 
tionary. From our approach we find that user intention behind ambiguous query can be identify 
significantly. This unsupervised approach not only handles the corpus by extracting and analyzing 
significant terms, but also form desire clusters for real time query. Further we would extend our 
work for the multi word query and improving these clusters using ranking techniques. 
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Appendix 



Dl 


apple computer released new wireless keyboard and apple trees 
are more in our country. 


D2 


all vegetables trees are different from apple trees. 


D3 


the apple mouse is a multi-button USB mouse manufactured and 
sold by apple Inc. 


D4 


apple as juice or fruit is very tasty and apple launch new LED monitor. 



Table 1: Sample documents taken for experiment. 



Dl 


apple computer keyboard apple tree country 


D2 


vegetable tree apple tree 


D3 


apple mouse mouse apple 


D4 


apple juice fruit apple monitor 



Table 2: Documents after preprocessing. 
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TCpvwnT*rl 


Token TD 









1 


tree 


2 


keyboard 


3 


mouse 


4 


juice 


5 


country 


6 


vegetable 


7 


fruit 


8 


monitor 


9 



Table 3: Keywords & respective token IDs. 



Keyword 


Dl 


TF1 


D2 


TF2 


D3 


TF3 


D4 


TF4 


IDF 


apple 


2 


0.33 


1 


0.25 


2 


0.5 


2 


0.4 





computer 


1 


0.16 




















0.602 


tree 


1 


0.16 


2 


0.5 














0.301 


keyboard 


1 


0.16 




















0.602 


mouse 














2 


0.5 








0.602 


juice 




















1 


0.2 


0.602 


country 


1 


0.16 




















0.602 


vegetable 








1 


0.25 














0.602 


fruit 




















1 


0.2 


0.602 


monitor 




















1 


0.2 


0.602 



Table 4: Calculation of TF-IDF for each documents. 
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D1 


D2 


D3 


D4 













o 
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09632 


o 


o 


o 


tree 


0.04816 


0.1505 








keyboard 


0.09632 











mouse 








0.301 





juice 











0.1204 


country 


0.09632 











vegetable 





0.1505 








fruit 











0.1204 


monitor 











0.1204 



Table 5: Weight: TF x IDF. 



Documents 


Corresponding Document Vectors 


Dl 


[(0, 0), (1, 0.09632), (2, 0.04816), (3, 0.09632), (4, 0), 
(5, 0), (6, 0.09632), (7, 0), (8, 0), (9, 0)] 


D2 


[(0, 0), (1, 0), (2, 0.1505), (3, 0), (4, 0), (5, 0), 
(6, 0), (7, 0.1505), (8, 0),(9, 0)] 


D3 


[(0, 0), (1, 0), (2, 0), (3, 0), (4, 0.301), (5, 0), (6, 0), 
(7, 0), (8, 0), (9, 0)] 


D4 


[(0, 0), (1, 0), (2, 0), (3, 0), (4, 0), (5, 0.1204), (6, 0), 
(7, 0), (8,0.1204), (9, 0.1204)] 



Table 6: Representation of document as vectors. 
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Community Vector 


Associated Keywords 


[(ID,Frequency)] 


Computer (CI) 


computer, keyboard, 
mouse, monitor 


[(1,1), (3,1), (4,2), (9,1)] 


Fruit (C2) 


fruit, tree, vegetable, juice 


[(8,1), (2,3), (7,1), (5,1)] 



Table 7: Community vectors formed from communities as [ID, Frequency]. 



Keyword 


CI 


TFC1 


C2 


TF C2 


IDF 


Weight = TFx IDF 


CI 


C2 


apple 























computer 


1 


0.25 








0.602 


0.1505 





tree 








3 


0.75 


0.301 





0.22575 


keyboard 


1 


0.25 








0.602 


0.1505 





mouse 


2 


0.5 








0.602 


0.301 





juice 








1 


0.25 


0.602 





0.1505 


country 














0.602 








vegetable 








1 


0.25 


0.602 





0.1505 


fruit 








1 


0.25 


0.602 





0.1505 


monitor 


1 


0.25 








0.602 


0.1505 






Table 8: TF-IDF calculation for community vector. 



Document/Community 


CI 


C2 


Resultant Cluster 
(Max(Cl,C2)) 


Dl 


0.41939 


0.18165 


CI (Computer) 


D2 


0.0 


0.77149 


C2 (Fruit) 


D3 


0.75593 


0.0 


CI (Computer) 


D4 


0.21828 


0.5041 


C2 (Fruit) 



Table 9: Similarity between each community and document is tabulated. 



Query (apple) 


Community (sense) 


Cluster 


Cluster 1 


Computer 


D1,D3 


Cluster 2 


Fruit 


D2, D4 



Table 10: Final clustering of relevant documents. 
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